Evaluating OCR and Non - OCR Text

نویسندگان

  • Markus Junker
  • Rainer Hoch
چکیده

In literature, many feature types and learning algorithms are proposed for document classiication. However , an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate diierent text representations for document classiication, we have developed a tool which transforms documents into feature-value representations suitable for standard learning algorithms. In this paper we investigate seven document representations for German texts based on n-grams and single words. We compare their eeectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Leven...

متن کامل

Cvl Ocr Db, an Annotated Image Database of Text in Natural Scenes, and Its Usability

Text detection and optical character recognition (OCR) in images of natural scenes is a fairly new computer vision area but yet very useful in numerous applicative areas. Although many implementations gain promising results, they are evaluated mostly on the private image collections that are very hard or even impossible to get. Therefore, it is very difficult to compare them objectively. Since ...

متن کامل

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...

متن کامل

OCR Post-Processing Error Correction Algorithm Using Google's Online Spelling Suggestion

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...

متن کامل

OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion

With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...

متن کامل

Evaluating supervised topic models in the presence of OCR errors

Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments in which we evaluate three superv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997